Generating a Morphological Lexicon of Organization Entity Names
نویسندگان
چکیده
This paper describes methods used for generating a morphological lexicon of organization entity names in Croatian. This resource is intended for two primary tasks: template-based natural language generation and named entity identification. The main problems concerning the lexicon generation are high level of inflection in Croatian and low linguistic quality of the primary resource containing named entities in normal form. The problem is divided into two subproblems concerning single-word and multi-word expressions. The single-word problem is solved by training a supervised learning algorithm called linear successive abstraction. With existing common language morphological resources and two simple hand-crafted rules backing up the algorithm, accuracy of 98.70% on the test set is achieved. The multi-word problem is solved through a semi-automated process for multi-word entities occurring in the first 10,000 named entities. The generated multi-word lexicon will be used for natural language generation only while named entity identification will be solved algorithmically in forthcoming research. The single-word lexicon is capable of handling both tasks.
منابع مشابه
Assigning Inflectional Paradigms to Named Entities by Linear Successive Abstraction
This paper describes how a supervised learning method is used for assigning inflectional paradigms to organizational named entities as the main prerequisite for generating a morphological lexicon of these entities. An inflectional paradigm consists of a set of rules for generating all forms of a lexicon entry. A morphological lexicon consists of lexicon entries and their corresponding forms. Th...
متن کاملGenerating a Resource for Products and Brandnames Recognition. Application to the Cosmetic Domain
Named Entity Recognition task needs high-quality and large-scale resources. In this paper, we present RENCO, a based-rules system focused on the recognition of entities in the Cosmetic domain (brandnames, product names, ...). RENCO has two main objectives: 1) Generating resources for named entity recognition; 2) Mining new named entities relying on the previous generated resources. In order to ...
متن کاملCorpus-Based Lexeme Ranking for Morphological Guessers
Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a morphological lexicon, we need to determine their base form and indicate their inflectional paradigm. A base form and a paradigm define a lexeme. In this article, we evaluate a lexicon-based method augmented with data from a corpus or the...
متن کامل1 All in a Day ’ s Week
This paper presents a Frame Semantics analysis of calendric terms in English, Hebrew, and German, demonstrating the inextricable relationship between morphology and semantics. Here, we consider how Frame Semantics can be extended to morphological analysis. Traditionally, the morpheme has been taken as the smallest unit of meaning in linguistic analysis, with generative word formation rules gene...
متن کاملIntegrating Punctuation Rules and Naïve Bayesian Model for Chinese Creation Title Recognition
Creation titles, i.e. titles of literary and/or artistic works, comprise over 7% of named entities in Chinese documents. They are the fourth large sort of named entities in Chinese other than personal names, location names, and organization names. However, they are rarely mentioned and studied before. Chinese title recognition is challenging for the following reasons. There are few internal fea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008